Introduction

Goal

We are looking to understand the relationship between what major a students chooses and that student’s self assessment of the activities they engagement in, their values, and their perceived competencies in additional to their demographic characteristics (which could potentially serve an latent variables for cultural norms around STEM desirability).


Per the prompt:

Knowing more about the characteristics of students who choose STEM majors can help us predict how many students will choose those majors and support the students in those majors.

Assumptions Made

  • That the surveys are based on ‘probability sample’. (ie surveys where all studnets in our population have a chance of being selected and the probability of selection is known).
    • Believe this to a safe assumption because this survey went out to all students.
  • That the survey questions had already been validated to be used for this population. (ie the scales have been validated)

Design Decisions

At the end of this analysis, I wanted to produce a model that is reproducible and extendable to other datases.

  • With reproducibility in mind, I chose to perform the analysis within R and to document the process as well as the findings within RMarkdown
  • Where possible, I have sought to make this analysis portable and extendable through the folder project structure and code structure

Investigation Reason

If we can predict the major of the students and understand the driving factors there are a few actions that can be taken from the model:

  1. Looking for what characteristics are “valued” within the model:
    • “Optimize Enrollment”: Ie accept more/less students who meet the model characteristics
    • Adjust resources based upon currently available data (ie might see that there is a bubble of STEM students coming which might require more lab space)
  2. Looking for what characteristics that are not “valued” within the model:
    • Look to interview students who would have likely been within STEM, but due to a reason aren’t within STEM

EDA

Data Overview

Data Source

Our data for this analysis came from two differnt sources:

  1. The surevy results provided by the IR team.
    • Students of the 2012, 2013, 2014, and 2015 freshman cohorts were surveyed during their second semester of their first year.
  2. The factbooks for the corresponding years

Note: the factbooks were compiled from the Wake Forest Office of Institutional Research site Link

FigName

Sample of the data that was extracted from the factbooks:

FigName

Data imported and additional factors created

  • For this dataset I decided to transform the scales of the survey questionaire to categorical data because I did not want to assume later on a consistent and proportional increase in value across the scale.
  • In the review of the Race/Ethnicity distribution both witin the fact book as well as the survey response data, I found that ~90% of all students reported their ethnicty to be either White, Asian, or Hispanic/LatinX. Therefore I collapsed the remaining ethnicities within an “Other” group.
  • I also decided to create a new variable called STEM, which output a YES/NO if the major for the student fell within the STEM category. This is benefitial when developing the initial model which seeks to provide a prediction of the student’s major being either STEM or non-STEM.
# Load data
suppressWarnings(suppressMessages(
WFU_stu <- readr::read_csv("./Data/data_for_interviewees.csv") %>% 
  mutate_if(is.numeric,as.factor) %>%
  mutate(GENDER = trimws(toupper(GENDER)),
         STEM = case_when(MAJOR == 6 ~"YES-STEM", TRUE ~ "NO-STEM"),
         STEM = as.factor(STEM),
         FIRST_GEN = fct_collapse(FIRST_GEN, "YES" = c("2"), "NO" = c("1"))) %>%
  mutate(RACEgroup = fct_collapse(RACETHN, "Other" = c("1", "3", "5", "7"),
                                              "Asian" = c("2"),
                                              "LatinX" = c("4"),
                                              "White" = c("6"))) %>% 
  mutate(COLLEGE_GPA = fct_collapse(COLLEGE_GPA, "D or lower" = c("1"),
                                    "C" = c("2"),
                                    "C+" = c("3"),
                                    "B-" = c("4"),
                                    "B" = c("5"),
                                    "B+" = c("6"),
                                    "A-" = c("7"),
                                    "A or A+" = c("8"))) %>%  
  mutate(HS_GPA = fct_collapse(HS_GPA, "D or lower" = c("1"),
                                    "C" = c("2"),
                                    "C+" = c("3"),
                                    "B-" = c("4"),
                                    "B" = c("5"),
                                    "B+" = c("6"),
                                    "A-" = c("7"),
                                    "A or A+" = c("8"))) %>% 
  mutate(DISTANCE = fct_collapse(DISTANCE, "5 or less" = c("1"),
                                    "6 - 10" = c("2"),
                                    "11 - 50" = c("3"),
                                    "51 - 100" = c("4"),
                                    "101 - 500" = c("5"),
                                    "Over 500" = c("6"))) %>%
  ungroup()
))


EthLook <- readxl::read_excel("./Data/DataFromFactBooks.xlsx", sheet = "LookUp")

suppressWarnings(suppressMessages(
FB <- readxl::read_excel("./Data/DataFromFactBooks.xlsx", sheet = "FC_Data") %>% 
  gather(3:9, key= "year",value="StuCount") %>% 
  mutate(year = as.numeric(year),
         Gender= trimws(toupper(Gender))) %>% 
  mutate_if(is.numeric, replace_na, 0) %>% 
  left_join(., EthLook) %>% 
  mutate(ID = as_factor(ID),
    RACEgroup = fct_collapse(ID, "Other" = c("-1","1", "3", "5", "7"),
                                              "Asian" = c("2"),
                                              "LatinX" = c("4"),
                                              "White" = c("6")))
))


suppressWarnings(suppressMessages(
WFU_stu %>% group_by(YEAR) %>% summarise(CountStu_response = n()) %>% 
  left_join(., 
            (FB %>% mutate(year = as_factor(year)) %>% group_by(year) %>% summarise(TotalStu = sum(StuCount, na.rm=TRUE))),
            by= c("YEAR" = "year")) %>% 
  mutate(PercentResponse = CountStu_response/TotalStu) %>% 
  mutate(PercentResponse = scales::percent(PercentResponse)) %>% 
  kable(align='c') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
))
YEAR CountStu_response TotalStu PercentResponse
2012 513 1234 41.6%
2013 594 1230 48.3%
2014 395 1287 30.7%
2015 498 1284 38.8%

Note for future study: What happened within 2014 where the respose rate was substantially lower than the other years?

Overview of the provided dataset

visdat::vis_dat(WFU_stu, sort_type = FALSE, palette = "cb_safe") + 
  coord_flip() + 
  theme_bw()

visdat::vis_miss(WFU_stu) + coord_flip() + theme_bw()

Demographic Comparison

Compare survey response to student population using WFU fact book

The first we want look at is whether or not the survey respondents are in line with the student population during the same period of time.

Decided to use a Chi-square test of independence, which is commonly used for comparing categorical data. In our case we are testing if the proportion matches (which we would call our null hypothesis). If the p-value of the chi-square test is very small than we would reject the null hypothesis and say that our two populations are different.

Chi-Square: Gender + Race/Ethnicity

Looking into the comparison of gender and ethnicity

WFU_Resp <- WFU_stu %>% 
  group_by(GENDER, RACEgroup) %>% summarise(CountStu_Samp = n()) %>% 
  ungroup() %>% 
  mutate(Samp_Perc = CountStu_Samp/sum(CountStu_Samp))


TotalPop <- FB %>% filter(year <= 2015) %>% group_by(Gender, RACEgroup) %>% 
  summarise(CountStu_Pop = sum(StuCount, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(Pop_Perc = CountStu_Pop/sum(CountStu_Pop))

ComparSamp <- left_join(TotalPop, WFU_Resp,
                        by = c("Gender" = "GENDER", "RACEgroup" = "RACEgroup")) %>% 
  mutate_if(is.numeric, replace_na, 0) %>%
  mutate(NonRep_wt = Pop_Perc/Samp_Perc)

ComparSamp %>%
  round_df(2) %>% 
  kable(align='c') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
Gender RACEgroup CountStu_Pop Pop_Perc CountStu_Samp Samp_Perc NonRep_wt
FEMALE Other 247 0.05 115 0.06 0.85
FEMALE Asian 300 0.06 145 0.07 0.82
FEMALE LatinX 199 0.04 118 0.06 0.67
FEMALE White 1903 0.38 821 0.41 0.92
MALE Other 228 0.05 83 0.04 1.09
MALE Asian 242 0.05 99 0.05 0.97
MALE LatinX 155 0.03 61 0.03 1.01
MALE White 1761 0.35 558 0.28 1.25
chisq.test(ComparSamp$CountStu_Samp, p = ComparSamp$Pop_Perc)
## 
##  Chi-squared test for given probabilities
## 
## data:  ComparSamp$CountStu_Samp
## X-squared = 62.656, df = 7, p-value = 4.442e-11
## with a p-value ~0, means that the sample distribution doesn't match the population    



GEN_ETH_wt <- ComparSamp %>% select(Gender, RACEgroup, NonRep_wt)

Here, as with the two tests below, we find that the sample response rate does not represent the student population.

Chi-Square: Gender

Performing a chi-square test on the response counts of different genders

#chi square test for gender...
WFU_Resp <- WFU_stu %>% 
  group_by(GENDER) %>% summarise(CountStu_Samp = n()) %>% 
  ungroup() %>% 
  mutate(Samp_Perc = CountStu_Samp/sum(CountStu_Samp))

TotalPop <- FB %>% filter(year <= 2015) %>% group_by(Gender) %>% 
  summarise(CountStu_Pop = sum(StuCount, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(Pop_Perc = CountStu_Pop/sum(CountStu_Pop))

ComparSamp <- left_join(TotalPop, WFU_Resp,
                        by = c("Gender" = "GENDER")) %>% 
  mutate_if(is.numeric, replace_na, 0) %>% 
  mutate(NonRep_wt = Pop_Perc/Samp_Perc)

ComparSamp %>%
  round_df(2) %>% 
  kable(align='c') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
Gender CountStu_Pop Pop_Perc CountStu_Samp Samp_Perc NonRep_wt
FEMALE 2649 0.53 1199 0.6 0.88
MALE 2386 0.47 801 0.4 1.18
chisq.test(ComparSamp$CountStu_Samp, p= ComparSamp$Pop_Perc)
## 
##  Chi-squared test for given probabilities
## 
## data:  ComparSamp$CountStu_Samp
## X-squared = 43.198, df = 1, p-value = 4.947e-11
## with a p-value ~0, means that the sample distribution doesn't match the population    

With a low p-value the Chi-square test, we reject the null hypothesis and determine that the survey response does not match the proportions of the student population for gender.

Chi-Square: Race/Ethnicity

Looking at the ethnicity comparison

WFU_Resp <- WFU_stu %>% 
  group_by(RACEgroup) %>% summarise(CountStu_Samp = n()) %>% 
  ungroup() %>% 
  mutate(Samp_Perc = CountStu_Samp/sum(CountStu_Samp),
         RACEgroup = RACEgroup)


TotalPop <- FB %>% filter(year <= 2015) %>% group_by(RACEgroup) %>% 
  summarise(CountStu_Pop = sum(StuCount, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(Pop_Perc = CountStu_Pop/sum(CountStu_Pop))

ComparSamp <- left_join(TotalPop, WFU_Resp,
                        by = c("RACEgroup" = "RACEgroup")) %>% 
  mutate_if(is.numeric, replace_na, 0) %>%
  mutate(NonRep_wt = Pop_Perc/Samp_Perc)

ComparSamp %>%
  round_df(2) %>%
  kable(align='c') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
RACEgroup CountStu_Pop Pop_Perc CountStu_Samp Samp_Perc NonRep_wt
Other 475 0.09 198 0.10 0.95
Asian 542 0.11 244 0.12 0.88
LatinX 354 0.07 179 0.09 0.79
White 3664 0.73 1379 0.69 1.06
chisq.test(ComparSamp$CountStu_Samp, p = ComparSamp$Pop_Perc)
## 
##  Chi-squared test for given probabilities
## 
## data:  ComparSamp$CountStu_Samp
## X-squared = 18.778, df = 3, p-value = 0.0003039

Chi-square would indicate that the survey response does not match the proportions of the student population for Ethnicity

Non-response weights

By comparing the sample proportions to the population within the factbooks via a Chi-Square test we find that the sample does not align with the population.

This misalignment is likely due to non-response and from calculating the non-response weight (NonRep_wt) we see that:

  • Males for every ethnicity except for Asians responsed at a lower rate than the population (weight is >1)
  • For all females that their response rate was higher than that of the population (weight is <1)

We will use these weights when reviewing our descriptive stats.

GEN_ETH_wt %>% 
  arrange(desc(NonRep_wt)) %>% 
  round_df(., 3) %>% 
  kable(align='c') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
Gender RACEgroup NonRep_wt
MALE White 1.254
MALE Other 1.091
MALE LatinX 1.009
MALE Asian 0.971
FEMALE White 0.921
FEMALE Other 0.853
FEMALE Asian 0.822
FEMALE LatinX 0.670

STEM vs Rest Respondents

Below is an comparison of the count of students who fall within each category.

Compare response rates

Demographic overview

plotDEMO <- function(col_name, TitleNM){
  col_name <- enquo(col_name)

  WFU_stu %>% 
    # select(2:9, STEM, RACEgroup) %>% 
  left_join(., GEN_ETH_wt, by = c("GENDER" = "Gender", "RACEgroup" = "RACEgroup")) %>% 
  group_by(STEM, !!col_name) %>% 
  summarise(WeightedResp = sum(NonRep_wt, na.rm = TRUE)) %>% 
  ungroup() %>% 
  group_by(STEM) %>% 
  mutate(TotalRes = sum(WeightedResp),
         Percent = WeightedResp/TotalRes) %>% 
  ggplot(aes(x=!!col_name, y=Percent))+
    geom_bar(stat = "identity", aes(fill=!!col_name))+
    geom_text(aes(label = scales::percent(Percent),
                   y= Percent ), stat= "identity", vjust = -.5) + #, size=3
    theme_bw()+
    scale_fill_viridis(discrete=TRUE) +
    facet_grid( ~ STEM) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks=seq(0, 1, by=0.2)) + 
    expand_limits(y=c(0,1)) + 
    ggtitle(paste0(TitleNM, " percentage by STEM group [Adjusted for non-response]"))
  }

plotDEMO(GENDER, "Gender")    

plotDEMO(RACEgroup, "Aggregated Ethnicity")    

plotDEMO(YEAR, "Student Cohort")    

plotDEMO(FIRST_GEN, "First Generation") 

suppressWarnings(suppressMessages(plotDEMO(DISTANCE, "Distance from Home")))

suppressWarnings(suppressMessages(plotDEMO(HS_GPA, "HS GPA"))) 

plotDEMO(COLLEGE_GPA, "College GPA")

Responses to ACT Questions These questions asked students about the frequency at which they performed certain activities in the past year. The questions spanned questions about their activity outside of the classroom environment to inside the classroom.

Scale
1: Not at all
2: Occasionally
3: Frequently

Based on my personal bias (as a former STEM undergrad) I would expect the questions that would be most likley to be positively associated with STEM selection are:

ACT09: Look up scientific research articles and resources
ACT13: Write computer code

plotDEMO(ACT01, "ACT01") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT02, "ACT02") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT03, "ACT03") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT04, "ACT04") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT05, "ACT05") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT06, "ACT06") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT07, "ACT07") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT08, "ACT08") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT09, "ACT09") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT10, "ACT10") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT11, "ACT11") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT12, "ACT12") + xlab("Score") + theme(legend.position = "none")

plotDEMO(ACT13, "ACT13") + xlab("Score") + theme(legend.position = "none")

Responses to RATE Questions

These questions asked students about their perception of competence in a few areas. This group of questions fell into 2 general categories. 1) Student perception of interpersonal interactions and 2) student perception of their own academic related compentencies (like mathematical or writing abilities)

Scale
1: A major weakness
2: somewhat weak
3: Average
4: Somewhat strong
5: A major strength

Based on my personal bias, I would expect the questions that would be most likley to be positively associated with STEM selection are:

RATE6: Mathematical ability

I would also expect the sterotype of poor writing ability (RATE7) to potentially result in a negative association to STEM.

plotDEMO(RATE1, "RATE1") + xlab("Score") + theme(legend.position = "none")

plotDEMO(RATE2, "RATE2") + xlab("Score") + theme(legend.position = "none")

plotDEMO(RATE3, "RATE3") + xlab("Score") + theme(legend.position = "none")

plotDEMO(RATE4, "RATE4") + xlab("Score") + theme(legend.position = "none")

plotDEMO(RATE5, "RATE5") + xlab("Score") + theme(legend.position = "none")

plotDEMO(RATE6, "RATE6") + xlab("Score") + theme(legend.position = "none")

plotDEMO(RATE7, "RATE7") + xlab("Score") + theme(legend.position = "none")

Responses to AGREE Questions

These questions asked students about how they would assess the value of certain future goals or their perception of what community they belong to.

Scale
1: Strongly disagree
2: Disagree
3: Slightly disagree
4: Slightly agree
5: Agree
6: Strongly agree

Based on my personal bias, I would expect the questions that would be most likley to be positively associated with STEM selection are:

AGREE_01: I have a strong sense of belonging to a community of scientists
AGREE_02: I think of myself as a scientist
AGREE_03: I feel like I belong in the field of science
AGREE_08: Making a theoretical contribution to science is important to me

plotDEMO(AGREE_01, "AGREE_01") + xlab("Score") + theme(legend.position = "none")

plotDEMO(AGREE_02, "AGREE_02") + xlab("Score") + theme(legend.position = "none")

plotDEMO(AGREE_03, "AGREE_03") + xlab("Score") + theme(legend.position = "none")

plotDEMO(AGREE_04, "AGREE_04") + xlab("Score") + theme(legend.position = "none")

plotDEMO(AGREE_05, "AGREE_05") + xlab("Score") + theme(legend.position = "none")

plotDEMO(AGREE_06, "AGREE_06") + xlab("Score") + theme(legend.position = "none")

plotDEMO(AGREE_07, "AGREE_07") + xlab("Score") + theme(legend.position = "none")

plotDEMO(AGREE_08, "AGREE_08") + xlab("Score") + theme(legend.position = "none")

plotDEMO(AGREE_09, "AGREE_09") + xlab("Score") + theme(legend.position = "none")

Missingness Notes

Based on the descriptive statistics review of the variables, the missing data for “DISTANCE” and “HS_GPA”

Model

Develop Model for Major

The goal of the model is to be able to predict a binary outcome that is categorical (is the student likely to choose STEM or not as their major). These types of models are generally called “classification models” and two commonly used ones are Logistic regessions and random forests.


Logisitic Regression

In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a regession line which is then transformed via log. This algorithm is a part of the generalized linear model (glm) family, of which linear regession is a part.

Advantages: Logistic regression was designed for classification, and is very useful for understanding the influence of several independent variables on a single outcome variable.

Disadvantages: Works only when the predicted variable is binary, assumes all predictors are independent of each other, and assumes data is free of missing values.


Random Forest

Random forest classifier fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and control over-fitting.

Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases.

Disadvantages: Slow real time prediction, difficult to implement, and complex algorithm.


Model Overview

Training and Test data

We developing models, often we will split the data into “training” and “test” datasets. This allows to have a dataset which was not used for the creation be used when we test our model accuracy.
In the case of this model we have utilized a 75/25 (training/test) split.

Model Performance

  • Accuracy: (True Positive + True Negative) / Total Population
  • Precision: “how sensitive models are to False Positives” (i.e. predicting a student will not choose STEM is leaving when he-she actually chooses STEM)
  • Recall: “how sensitive models are to False Negatives” (i.e. predicting that a student will choose STEM he-she will not choose STEM).

All of these outputs can be calculated using a confusion matrix which visually allows us to see how many students are correctly assigned (true positive and true negative) vs how many aren’t when we use our model with the test dataset.


WFU_model <- WFU_stu %>% 
  # mutate(STEM = case_when(MAJOR == 6 ~"YES", TRUE ~ "NO")) %>% 
  # select(-MAJOR, -STUID, -YEAR, -DISTANCE, -HS_GPA) %>% 
  select(STEM, GENDER, RACEgroup, COLLEGE_GPA, FIRST_GEN, 
         ACT02, ACT08, ACT09, ACT12, ACT13, 
         RATE3, RATE5, RATE7, 
         AGREE_01, AGREE_02, AGREE_03, AGREE_08, AGREE_06, AGREE_07
         ) %>%
  mutate(GENDER = as_factor(GENDER),
         STEM = as_factor(STEM))

# Using tidymodels structure ---------

train_test_split <-
  rsample::initial_split(
    data = WFU_model,     
    prop = 0.75   
  ) 

# train_test_split

train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing()

recipe_prepped <- recipe_simple(dataset = train_tbl)

train_baked <- bake(recipe_prepped, new_data = train_tbl)
test_baked  <- bake(recipe_prepped, new_data = test_tbl)

Logistic Regression

# logGLM <- glm(STEM ~  ., data=train_baked, family="binomial")

logistic_glm <-
  logistic_reg(mode = "classification") %>%
  set_engine("glm") %>%
  fit(STEM ~ ., data = train_baked)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# logistic_glm

predictions_glm <- logistic_glm %>%
  predict(new_data = test_baked) %>%
  bind_cols(test_baked %>% select(STEM))




predictions_glm %>%
  metrics(STEM, .pred_class) %>%
  select(-.estimator) %>%
  filter(.metric == "accuracy") %>% 
    round_df(., 3)
# Precision shows how sensitive models are to False Positives (i.e. predicting a customer is leaving when he-she is actually staying) whereas 
# Recall looks at how sensitive models are to False Negatives (i.e. forecasting that a customer is staying whilst he-she is in fact leaving).

tibble(
  "precision" =
    precision(predictions_glm, STEM, .pred_class) %>%
    select(.estimate),
  "recall" = 
    recall(predictions_glm, STEM, .pred_class) %>%
    select(.estimate)
) %>%
  unnest() %>%
  round_df(., 3) %>% 
  kable(align='c') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
precision recall
0.972 0.963
predictions_glm %>% 
  mutate(Prediction = ifelse(.pred_class == "YES-STEM", 1, 0)) %>% 
  arrange(Prediction) %>%  
  mutate(ord = row_number()) %>% 
  ggplot(aes(x=ord, y=Prediction))+
    geom_point(aes(color=STEM), size =2, alpha = .5)+
    ylab("Predicted STEM")+
    xlab("")+
    scale_y_continuous(breaks=seq(0, 1, by=1)) + 
    expand_limits(y=c(0,1)) + 
    theme_bw()+
    viridis::scale_color_viridis(discrete=TRUE)+
    ggtitle("Logistic Regression Plot [Prediction vs Actual for Test data]")

Odds Ratio

utilizing odds ratio to interpret the elements of the logisitic model

tidy(logistic_glm) %>% 
  mutate(pval = case_when(p.value < 0.05 ~ "<0.05", TRUE ~ ">0.05"),
         LogOdds = exp(estimate)) %>% 
  arrange(desc(abs(LogOdds))) %>% 
    round_df(3) %>% 
  ggplot(aes(x=fct_reorder(term, (estimate)), y=estimate))+
    geom_point(aes(color=pval))+
    geom_segment(aes(x = term, y = 0, xend = term, yend = estimate), color = "grey50") +
    geom_hline(yintercept = 0)+
    coord_flip()+
    theme_bw()+
    viridis::scale_color_viridis(discrete = TRUE)+
    xlab("")+
    ggtitle("Coefficient Plot: Logistic Regression")

tidy(logistic_glm) %>% 
  mutate(pval = case_when(p.value < 0.05 ~ "<0.05", TRUE ~ ">0.05"),
         Odds = exp(estimate)) %>% 
  filter(pval == "<0.05") %>%
  round_df(2) %>% 
  rename(LogOdds = estimate) %>%
  arrange(desc(abs(Odds))) %>% 
  kable(align='c') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
term LogOdds std.error statistic p.value pval Odds
AGREE_016 33.68 4.98 6.76 0.00 <0.05 4.235178e+14
RACEgroupAsian 26.43 3.42 7.72 0.00 <0.05 2.996585e+11
AGREE_086 23.97 5.07 4.73 0.00 <0.05 2.583361e+10
AGREE_015 22.38 3.00 7.45 0.00 <0.05 5.250958e+09
AGREE_026 20.64 3.25 6.35 0.00 <0.05 9.191668e+08
AGREE_035 17.73 2.42 7.33 0.00 <0.05 5.011711e+07
AGREE_025 17.48 2.38 7.34 0.00 <0.05 3.889190e+07
AGREE_014 17.36 2.31 7.51 0.00 <0.05 3.477342e+07
GENDERMALE 15.18 1.93 7.86 0.00 <0.05 3.932263e+06
AGREE_085 14.97 2.04 7.35 0.00 <0.05 3.167682e+06
AGREE_084 12.55 1.72 7.29 0.00 <0.05 2.826291e+05
AGREE_034 12.17 1.78 6.83 0.00 <0.05 1.920428e+05
AGREE_013 12.04 1.62 7.44 0.00 <0.05 1.698043e+05
AGREE_024 11.81 1.68 7.04 0.00 <0.05 1.350137e+05
AGREE_033 9.49 1.39 6.81 0.00 <0.05 1.328492e+04
AGREE_083 7.94 1.20 6.63 0.00 <0.05 2.803020e+03
AGREE_023 7.64 1.16 6.57 0.00 <0.05 2.070910e+03
FIRST_GENYES 7.22 1.11 6.53 0.00 <0.05 1.368710e+03
RACEgroupWhite 6.44 1.15 5.61 0.00 <0.05 6.263100e+02
AGREE_082 5.13 0.93 5.50 0.00 <0.05 1.683500e+02
AGREE_012 4.26 0.80 5.32 0.00 <0.05 7.078000e+01
AGREE_032 4.26 0.82 5.17 0.00 <0.05 7.067000e+01
AGREE_022 3.52 0.79 4.48 0.00 <0.05 3.381000e+01
ACT123 -1.71 0.83 -2.06 0.04 <0.05 1.800000e-01
RATE34 -2.02 0.94 -2.14 0.03 <0.05 1.300000e-01
AGREE_072 -2.07 0.69 -3.01 0.00 <0.05 1.300000e-01
(Intercept) -42.89 18.37 -2.34 0.02 <0.05 0.000000e+00
RATE55 -6.72 1.76 -3.83 0.00 <0.05 0.000000e+00
AGREE_062 -7.54 1.09 -6.91 0.00 <0.05 0.000000e+00
AGREE_063 -20.07 2.59 -7.75 0.00 <0.05 0.000000e+00
AGREE_064 -38.18 5.00 -7.63 0.00 <0.05 0.000000e+00
AGREE_065 -55.86 7.15 -7.81 0.00 <0.05 0.000000e+00
AGREE_073 -7.74 1.20 -6.47 0.00 <0.05 0.000000e+00
AGREE_074 -13.74 1.93 -7.10 0.00 <0.05 0.000000e+00
AGREE_075 -22.74 3.34 -6.82 0.00 <0.05 0.000000e+00
AGREE_076 -32.44 4.52 -7.18 0.00 <0.05 0.000000e+00
# the odds of getting into an honors class for females (female = 1)over the odds of getting into an honors class for males (female = 0) is exp(.979948) = 2.66. 

Random Forest

Utilizing a random forest model to classify if a student will be graduating within a STEM major or not.

# WFU_RF <- WFU_stu %>% 
#   # mutate(STEM = case_when(MAJOR == 6 ~"YES", TRUE ~ "NO")) %>% 
#   select(-MAJOR, -STUID, -YEAR, -DISTANCE, -HS_GPA, -RACETHN, -DISTANCEval) %>% 
#   # select(STEM, GENDER, RACETHN, COLLEGE_GPA, FIRST_GEN, ACT09, ACT12, ACT13, RATE5, RATE7, AGREE_01, AGREE_02, AGREE_03, AGREE_08) %>% 
#   mutate(GENDER = as_factor(GENDER),
#          STEM = as_factor(STEM))
# 
# rf_train_test_split <-
#   rsample::initial_split(
#     data = WFU_RF,     
#     prop = 0.80   
#   ) 
# 
# # train_test_split
# 
# rf_train_tbl <- rf_train_test_split %>% training() 
# rf_test_tbl  <- rf_train_test_split %>% testing()
# 
# rf_recipe_prepped <- recipe_simple(dataset = rf_train_tbl)
# 
# rf_train_baked <- bake(recipe_prepped, new_data = rf_train_tbl)
# rf_test_baked  <- bake(recipe_prepped, new_data = rf_test_tbl)

rf_build <- 
  rand_forest(trees = 1000, mtry = 6, mode = "classification") %>%
  set_engine("randomForest", importance=T, localImp = T) %>%
  fit(STEM ~ ., data = train_baked)


predictions_rf <- rf_build %>%
  predict(new_data = test_baked) %>%
  bind_cols(test_baked %>% select(STEM))


predictions_rf %>%
  metrics(STEM, .pred_class) %>%
  select(-.estimator) %>%
  filter(.metric == "accuracy") %>% 
    round_df(., 3) %>% 
  kable(align='c') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
.metric .estimate
accuracy 0.856
impt_frame <- measure_importance(rf_build$fit)
## Warning: Factor `split var` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `split var` contains implicit NA, consider using
## `forcats::fct_explicit_na`
plot_multi_way_importance(impt_frame, no_of_labels = 10)

plot_multi_way_importance(impt_frame, x_measure = "accuracy_decrease", y_measure = "gini_decrease", size_measure = "p_value", no_of_labels = 10)
## Warning: Using alpha for a discrete variable is not advised.

#variable depth
md_frame <- min_depth_distribution(rf_build$fit)
## Warning: Factor `split var` contains implicit NA, consider using
## `forcats::fct_explicit_na`
plot_min_depth_distribution(md_frame, mean_sample = "top_trees") # default mean_sample arg

Confusion Matrices

The confusion matrix allows us to show (based on the test data) how often we are correctly predicting each category.

When we compare the logistic regression vs the random forest we see that the logistic regression model performs better than the random forest model. The logistic regession model also had the added benefit of being significantly easier to interpret than the random forest.

# Confusion matrix for Logistic
predictions_glm %>%
  conf_mat(STEM, .pred_class) %>%
  pluck(1) %>%
  as_tibble() %>%
  ggplot(aes(Prediction, Truth, alpha = n)) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)+
  theme_bw()+
  ggtitle("Confusion Plot: Logistic Model")





# Confusion matrix for RF
predictions_rf %>%
  conf_mat(STEM, .pred_class) %>%
  pluck(1) %>%
  as_tibble() %>%
  ggplot(aes(Prediction, Truth, alpha = n)) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)+
  theme_bw()+
  ggtitle("Confusion Plot: Random Forest Model")

Takeaways/ Application

How to use the model

So that are the student characteristics that make them more likely to choose STEM as their major?

  • On a demographic standpoint they are more likely to be:
    • male
    • white or asian
    • have a fairly high college gpa
  • On a student characteristic standpoint they are more likely to:
    • believe themselves to be part of a scientific community (AGREE_01)
    • feel like they belong in the field of science (AGREE_03)
    • want to make theoretical contributions to scientific fields (AGREE_08)
    • think of themselves as a scientist (AGREE_02)

If we want to optimize the current model

  1. Would probably want to recruit from scientific communities.
  2. With marketing, could highlight Wake’s strong STEM majors in communities known to be more predominately white/asain

If we want to correct for reasons why predictive elements

  1. Start to interview female students or students from other ethnicities to explore more about why they didn’t choose STEM.
  2. Investigate why developing a life philosophy (AGREE_06) or being a community leader (AGREE_07) isn’t important to students when choosing STEM. This could be more of how STEM is culturally understoon.

Research Items

  1. Seeing that “College GPA” has a fairly large impact in the model, a next step in this investigation would be to see if the types of classes that students took impacted their GPA.
  2. Leading vs lagging predictors: ie not sure if students chose STEM because they had a community or because they joined STEM they felt like they had a STEM community.

Limitations

  • Did not review the students who changed their major later in their academic career, which could be used a test cases for how to transition the current population into STEM
  • Data doesn’t show which students graduate within STEM. THis would help to understand if the support structure could be improved for those majors.
  • Assuming that the count of students from the fact book are representative of students from the survery (start of year vs 2nd semester)
  • Survey response distribution does not match with the student population
  • Did not investigate the characterize the students who were admitted but did not enroll, which could be an opportunity to quickly increase the STEM enrollment
 

A work by Phil Walker

philwalker2012@gmail.com